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Abstract —The generalization error bound of support vector 
machine (SVM) depends on the ratio of radius and margin, while 
standard SVM only considers the maximization of the margin but 
ignores the minimization of the radius. Several approaches have 
been proposed to integrate radius and margin for joint learning 
of feature transformation and SVM classifier. However, most of 
them either require the form of the transformation matrix to be 
diagonal, or are non-convex and computationally expensive. 

In this paper, we suggest a novel approximation for the radius 
of minimum enclosing ball (MEB) in feature space, and then 
propose a convex radius-margin based SVM model for joint 
learning of feature transformation and SVM classifier, i.e., F- 
SVM. An alternating minimization method is adopted to solve 
the F-SVM model, where the feature transformation is updated 
via gradient descent and the classifier is updated by employing 
the existing SVM solver. By incorporating with kernel principal 
component analysis, F-SVM is further extended for joint learning 
of nonlinear transformation and classifier. Experimental results 
on the UCI machine learning datasets and the LFW face datasets 
show that F-SVM outperforms the standard SVM and the 
existing radius-margin based SVMs, e.g., RMM, R-SVM+ and 
R-SVM+. 

Index Terms —Support vector machine, radius margin bound, 
convex relaxation, max-margin. 

1. Introduction 

S UPPORT vector machine (SVM) and its extensions have 
been one of the most successful machine learning methods 
and have been adopted in various fields, e.g., computer 
vision 0, (H, 0, 0, signal processing 0, 0, natural 
language processing (91, ifTOl and bioinformatics (Til, (El, 
na, d. Despite its popularity, SVM aims to seek the 
optimal hyperplane with the maximum margin principle, but 
the generalization error of SVM actually is a function of 
the ratio of radius and margin (TSl . Given feature space, the 
radius is fixed and can be ignored, thus SVM can minimize 
the generalization error by maximizing the margin. However, 
for joint learning of feature transformation and classifier, the 
radius information will be valuable and cannot be ignored. 

By minimizing the radius-margin ratio, the generalization 
error of SVM can be optimized for joint learning of feature 
transformation and classifier. Since the radius-margin error 
bound is non-convex, relaxation and approximation of radius 
is generally adopted in the existing models m, C3. Several 
approaches have been proposed from the perspective of radius- 
margin error \m, M, 113, ca, but most ones suffer from 
the limitations of computational burden and simplified forms 
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of transformation. RMM m only considers the spread of 
the data along the direction perpendicular to the classification 
hyperplane. Radius-margin based SVMs, e.g., MR-SVM uni, 
R-SVM+ f\M and RSVM+ (TS), are based on the constraint 
that the linear transformation matrix should be diagonal. 

Another strategy for joint feature transformation and clas¬ 
sifier learning is to incorporate metric learning with SVM, 
where metric learning can be adopted to learn a better linear 

transformation matrix |[T9ll , ifTTl . (JO), (211, (221, (23, (231, 
(23, (261 . One simple approach to combine metric learning 
and SVM is to directly deploy the transformation obtained 
using metric learning into SVM. This approach, however, 
usually cannot lead to satisfactory performance improvement 
Cl. Therefore, other approaches have been proposed to 
integrate metric learning to SVM, e.g., support vector metric 
learning (SVML) (^ and metric learning with SVM (MSVM) 
(TTl . But SVML (27l was designed for RBF-SVM and ignored 
the radius information, while MSVM ns is non-convex. 

In this paper, we propose a novel radius-margin based SVM 
model for joint learning of feature transformation and SVM 
classifier, i.e., F-SVM. Compared with the existing radius- 
margin based SVM methods, we derive novel lower and upper 
bounds for the relaxation of the radius-margin ratio. Unlike 
MR-SVM (ni, R-SVM+ (H and RSVM+ (H which are 
suggested for joint feature weighting and SVM learning, F- 
SVM can simultaneously learn feature transformation L and 
classifier (w,6). Compared with the existing metric learning 
for SVM methods, our F-SVM model considers both the 
radius and the margin information, and is convex. Then, an 
alternating minimization algorithm is proposed to solve our F- 
SVM model, which iterates by updating feature transformation 
and classifier alternatively. Note that kernel SVM is equivalent 
to perform linear SVM in the kernel PC A space. We further 
suggest to conduct linear FSVM in the kernel PCA space for 
joint learning of nonlinear transformation and classifier. The 
contribution of this paper is of three-fold: 

• A novel convex formulation of radius-margin based SVM 
model, i.e., F-SVM, is proposed. Unlike MR-SVM (Til . 
R-SVM+ (H and RSVM+ dSl, our F-SVM is capable 
of joint learning feature transformation and classifier, and 
is robust against outliers. Experimental results show that 
F-SVM outperforms SVM and the existing radius-margin 
based SVMs. 

• In F-SVM, we derive the lower and upper bounds for 
the radius of minimum enclosing ball (MEB) in feature 
space, resulting in a novel approximation of the radius. 
Compared with the approximations proposed in (T3 . ours 
is much simple and can be easily adopted in developing 
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radius-margin based SVM models. 

• An alternating minimization algorithm is developed for 
solving F-SVM via iterating between gradient descent 
and SVM learning. Therefore, the off-the-shelf SVM 
solvers can be employed to improve the computational 
efficiency. Moreover, a semi-whitened PCA method is 
developed for the initialization of M = L^L. 

The remainder of the paper is organized as follows: Section 
2 reviews the related work on the radius-margin ratio based 
bounds and their applications. Section 3 describes the model 
and algorithm of the proposed F-SVM method. Section 4 
extends F-SVM to the kernelized version for nonlinear clas¬ 
sification. Section 5 provides the experimental results on the 
UCI machine learning datasets and the LFW dataset. Finally, 
we conclude the paper in Section 6. 

II. Related work 

The radius-margin bound not only provides theoretical 
explanation on the generalization performance of SVM Cl, 
but also has been extensively adopted for improving ker¬ 
nel classification methods, e.g., model selection 1^ . 1^ , 
multiple kernel learning (MKL) (301, ( 211 , (321 . (331 . and 
mapping of nominal attributes (34l . Denote a training set by 
S = {(xi, ^i),..., (Xn, Vn)} and a feature space by Ti : ^ (x). 
In (301 . (Ml . the radius R of minimum enclosing ball (MEB) 
in feature space is computed as: 

inini?^,s.t.||i>(xi) - $(xo )||2 < ,i = 1,2, ■ ■ ■ ,n. (1) 

i?,xo 

Assuming that the training set is separable, given the optimal 
hyperplane {w,b), Vapnik Cl suggested a radius-margin error 
bound which showed that the expectation of the misclassifica- 
tion probability depends on i?^||w|| 2 . 

The standard SVM is known as a max-margin method which 
only considers the margin l/||w ||2 in the algorithm. When 
the feature space is fixed, the radius is a constant and can 
thus be ignored. But in many classification tasks, the model 
parameters (281 . combination of basis kernels (3X1 . feature 
reweighting or transformation lfT2l usually should be learned 
or tuned based on the training data, where integration of radius 
has been demonstrated to be very effective in improving the 
classification performance. In model selection, radius-margin 
bound has been applied for choosing tradeoff parameter and 
scaling factors of SVM and Li-SVM (281 . In multiple kernel 
learning (MKL) (301 . (3X1 . (321 and feature reweighting (X2l . 
several variants of radius had been developed. 

This paper aims to jointly learn SVM together with feature 
transformation by minimizing the radius-margin ratio, i.e., 
radius-margin based SVM, and more detailed review is given 
on this topic. Except (Xvl . most existing approaches (XXI . (Xsl, 
m require the transformation matrix to be diagonal, i.e., 
feature reweighting and selection. Direct use of radius-margin 
ratio IIwill in SVM results in a non-convex optimization 
problem, which makes the learning algorithm computationally 
expensive and unstable. By restricting the feature transforma¬ 
tion to be diagonal = Diag (/i) with fik > 0, Do et al. 
nil suggested that the radius is bounded with maxj; UkRl < 
Rl< ^f^gkRk^ where Rk is the radius on dimension k. By 


approximating R^ with its upper bound MR-SVM 

in nil solved the following convex relaxation problem: 

1 rc? C 

mm - > — + > A, 

s.t. yi(w^Xi + b) > 1 - (2) 

> 0, z = 1,2, • • • , n, 

E , Mfe = > 0,V^- 

k 

Denote Rq by the half value of the maximum pairwise 
distances. Do et al. in m introduced a tighter bound of the 
radius Rq < Rfi < Rq and proposed another convex 
model R-SVM+: 

fj, 


min 

S.t. 


2 ^-^k jUp. 

z/j(w^Xj + 6) >l-ii,'ii, 


Ci >0,z = l,2,--- ,n, 

1 II ||2 


(3) 


Furthermore, R-SVM+ was developed in ifTSll by controlling 
both the radius and margin with w. 

Rather than feature reweighting and selection, Zhu et al. 
(Xvl proposed a metric learning with SVM (MSVM) method 
for joint learning of the linear transformation and SVM classi¬ 
fier. In ini, given the transformation matrix A, an alternative 
R = maxi II Ax^ — Ax||| of the radius R was adopted, where 
X is the mean of the training samples. Although Zhu et al. 
uni claimed that R = R, SiS demonstrated in Theorem 1 of 
this work, R is an upper bound of R. The MSVM model in 
ini was formulated as: 


min 

w,6,A 

S.t. 


\ iiwii2 + cy]],^i, 

2/j(w^Axj + b) > 1 - 
> 0, i = 1, 2, • • • , n, 
||Ax^ — Ax||^ < 1, Vi. 


(4) 


Note that MSVM is non-convex and solved using gradient 
projection. 

In this paper, we propose a novel relaxed convex model 
of radius-margin based SVM, i.e., F-SVM, for joint learning 
of feature transformation and SVM classifier. Compared with 
the existing radius-margin based SVM methods, F-SVM has 
some distinguishing features. Our F-SVM model is convex, 
while MSVM (XTll is non-convex. Unlike RMM (161, the 
transformation in F-SVM is learned to minimize the radius of 
the enclosing ball of all samples rather than to only shrink 
the sample span along the direction perpendicular to the 
hyperplane. Moreover, F-SVM is also different with MR-SVM 
(XX]| . R-SVM+ dSl and R-SVM+ (181 from three aspects: 
(i) Instead of feature reweighting and selection, F-SVM can 
learn feature transformation and classifier simultaneously; {ii) 
F-SVM adopts a new approximation for the radius of MEB in 
feature space; (in) In E-SVM, individual inequality constraints 
are combined into one holistic inequality constraint to improve 
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the robustness against outliers. All these make F-SVM very 
promising for joint learning of feature transformation and 
SVM classifier, and the results also validate the effectiveness 
of F-SVM. 


III. Radius-margin based Support Vector Machine 


A. Problem Formulation 


Denote S = {(xi, ),..., (xn, ^n)} by a training set, 

where x^ G and yi G {—1,+1} denote the ith training 
sample and the corresponding class label, respectively. By 
introducing the slack variables (i = 1, 2,..., n), SVM aims 
to find the optimal separating hyperplane by solving the 
following optimization problem: 


min 

u, 6 ,.e 

S.t. 


yi{u^Xi + b) > 

^71. 


(5) 


where (u, 6 ) are the parameters to describe the learned hy¬ 
perplane u^x -\-b = 0, denotes the ith slack variable, and 
C stands for the tradeoff parameter. The objective function 
in Eq. © aims to maximize the margin 7 = l/||u||^ while 
minimizing the empirical risk For joint learning, we 

introduce a linear transformation matrix A and integrate the 
radius information, resulting in the following radius-margin 
based SVM model: 


Theorem 2. The problem in Eq. ® is equivalent with the 
following problem: 

min L(w, b, M)=i ^ (w'^M"iw)+C V > , 
w, 6 , 4 ,M I ^— 1 J 

s.t. yi{w'^Xi + b)>l-^i,'ii, 

6 > 0 ,i= - ,n, 

(xi - x)^M(xi - x) < l,Vi, 

M 0. 

Proof. Denote (w, 6 , M, R) by the optimal solution to the 
problem in Eq. ®. Let M = M/R^ and R = 1. It 
is obvious to see that (w, 6 , M, R) is also the optimal 
solution to the problem in Eq. (|9]) because F(w, 6 , M, R) = 

Next we will show that (w, 6, M) is the optimal so¬ 
lution to the problem in Eq. ([Tot . If (w,6, ^,M) is not 
the optimal solution to Eq. (fTOl) , there must exist some 
that satisfies all inequality constraints and 
M). Then we can define R = 
1 and have F(w*, 6*, <f*, M*, R) < F(w, 6, M, R), which 
is contradictory with the assumption that (w, 6, M, R) is the 
optimal solution to Eq. ®. Thus, we can solve the problem 
in Eq. ([TOl) with the optimal solution (w,6, ^,M), and then 
obtain the optimal solution (w, 6, M, R) to Eq. (|9]). □ 


min 

u,b,C,A,R 

S.t. 


1\\u\\Ir^ + CY,M, 

^^(u^Axi + 6 ) > 1 - Vi, 

> 0 ,i = 1 , 2 , • • • ,n. 


where the radius R is defined as: 


( 6 ) 


min 5 .t.|| Ax^ — AX 0 II 2 < i = 1, 2, • • • , n. (7) 

i?,Xo 

Note that R‘^ depends on matrix A and the problem in Eq. ® 
is non-convex |[T^ . Denote xq by the center of all instances, 
and R by the largest squared distance from the center in 
transformed feature space. Let xq = x = 

R = maxi ||Axi — AxU^. We prove that the radius R is 
bounded by R. 

Theorem 1. The radius R is bounded by R by: 

^R<R<R. ( 8 ) 

Please refer to Appendix A for the proof of Theorem 1. In 
ifTTl . Zhu et al. claimed that R = R . Erom Theorem 1, 
R is only an approximation of R, and counter examples can 
be easily found to illustrate R R. Let w = A^u and 
M = A^A. Since the radius R is upper bounded by R, we 
can approximate R with R. With simple algebra, the radius- 
margin SVM model in Eq. ® is relaxed into the following 
formulation: 

1 _ n 

min _ F(w, 6 ,^,]V[,i?)=-(w^M"V) + 

w,6,4]vtit: 2 

s.t. 2 /j(w'^Xj + 6 ) > 1 - (9) 

ii > 0 , i = 1 , 2 , • • • , n, 

(xi — x)^M(xi — x) < R^. 


Without loss of generality, we assume R = 1 and seek 
the corresponding optimal w and M by solving Eq. (Uni). 
Moreover, to make the model robust against outliers and noisy 
samples, we combine the individual inequality constraints 
(xi — x)^M(xi — x) < l,i = 1,2, ’ ’ ’ ,n into one integrated 
inequality constraint El, resulting in the following radius- 
margin based SVM model: 


min 

S.t. 


1 ^ 

yi(w^Xi + 6 ) > 1 - 

> 0 ,z = l, 2 ,--- ,n, 

n 

F. “ x)^]V[(Xj - X) < K, 

i=l 

M ^ 0. 


( 11 ) 


By defining the scattering matrix of the training set S = 
Z]r=i based on the Lagrangian multiplier 

method El, the problem expressed in Eq. (fTTI) can be 
equivalently reformulated as the following E-SVM model: 

1 "" 

min - (v^^M“^v^) -f C + ptr (MS), 

i—l 

s.t. yi(w^Xi + 6) > 1 - Ci, Vi, (12) 

> 0 ,i = 1 , 2, • • • ,n, 

MyO. 


where p is the regularization parameter determined by n. In 
the following, we prove that our E-SVM model is convex. 

Theorem 3. The E-SVM model is a convex optimization 
problem. The proof can be found in Appendix B. 
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Fig. 1. Illustration of the alternating minimization algorithm for F-SVM. First, a semi-whitened based initialization scheme is proposed to initialize M. Then, 
the alternating minimization algorithm is adopted by updating (w, b) and M alternatively. When the algorithm converges, as shown in Fig. [T] we can learn 
both a better matrix M to reduce the radius of MEB in feature space and a max-margin classifier (w, b). 


B. Alternating minimization 

In this section, we propose an efficient alternating min¬ 
imization algorithm to solve the proposed F-SVM model, 
as illustrated in Fig. [T] First, a semi-whitened PC A based 
initialization scheme is proposed to initialize M. Then, the 
alternating minimization algorithm is adopted by updating 
(w, h) and M alternatively. Fixing M, the model can be 
reformulated as the SVM in the transformed feature space, and 
be solved using the off-the-shelf SVM solvers to update (w, h). 
Fixing (w, 6 ), the gradient descent method ll35]| is adopted to 
update the matrix M. When the algorithm converges, as shown 
in Fig. [U we can learn both a better matrix M to reduce the 
radius of MEB in feature space and a max-margin classifier 
(w, 6 ). The alternating optimization procedure is summarized 
in Algorithm 1. 


Algorithm 1 The alternating minimization algorithm for F- 
SVM_ 

Require: Training set |Vi}. 

Ensure: Optimal M and 
l: k= I, 

2 : Initialize 

3: Mk = VFUHU^, 

4: H = 

5: repeat 

6 : // Lines 7-9: updating (w, 6 ). 

7: Do Eigenvalue decomposition on M/.: M/. = V5]V^, 

8 : Perform linear transformation on x^: 

9: Update the SVM classifier (v^, b) based on Z, 

10 : // Lines 11-18: updating M. 

11 : while not converged do 

12 : M = M/c, t = 1, 

13: Compute the gradient of M: 

14: V/(M) = + pS, 

15: Update M: M = Vs^ (M - tV/(M)), 

16: Update the stepsize t\ t ^ /3 ^ t, 

17: end while 

18: M/c+i = M, 

19: k ^ k^l, 

20 : until M and (w, b) converge 


1) Initialization ofM: Because the proposed F-SVM model 
is convex, alternating minimization can converge to global 
optimum for any initialization of M and (w, 6 ), but proper 
initialization is helpful in improving the computational effi¬ 
ciency. Thus, by further relaxing the F-SVM model in Eq. 
dH, we propose a semi-whitened PCA based initialization 
method on M. 

Note that w^M“^w is upper bounded by (361: 

= tr (ww^M"^) 

< ||w||2||M-i||2 (13) 

< l|w|l2||M-iL. 

where ||A ||2 and ||A||^ denote the L 2 -norm and the nuclear 
norm of a positive semi-definite matrix A, respectively. Based 
on Eq. ([12]) and Eq. (fTSl) . by setting B = M“^, the subprob¬ 
lem on M can be rewritten as the problem on B formulated 
as: 

min L(B) = ||B|| + r'tr (B“^S) , 

B V / II M* V / ’ 

s.t. B ^ 0. 

where r' = /)/||w||^. The eigenvalue decomposition of S is 
S = UAU^, where A = A 2 , • • • , \d) (Ai > A 2 > 

• • • > A(^ > 0), A^ and the ith column of U denote the ith 
eigenvalue and eigenvector, rsespectively. With U and A, we 
define B as: 

B = USU^,S = diafl{(T'Ai)'^,--- (15) 

Theorem 4. Given a SPD matrix S and r' > 0, B defined 
in Eq. (fTSl) is the optimal solution to the problem: 

B = argmin {i:(B, r') = ||B||, + r' {tr (B-^S))} . (16) 

The proof can be found in Appendix C. With B, the initial¬ 
ization of M in Eq. (O is then defined as: 

Mo = \/FuSU^,2 = c(iafl{(Ai)-^^,--- ,(Ad)-'^}. (17) 

Noted that we assume that ||v^||^ is known for the initialization 
of M. Erom Eq. (fTTI) . ||v^|f only affects the scale factor a/t^ 
to the linear transformation. Thus, we simply let ||v^||^ = 1 in 
our implementation. 

It is interesting to point out that Mq in Eq. ([TtI) im¬ 
plies a semi-whitening PCA transformation because Mq = 
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UA“^/^U^, where the linear transformation can then be 
defined as In literature JSTl, was 

called the PCA whitening transformation, and data whitening 
has been widely exploited in many applications, e.g., face 
recognition, object detection, and image classification ll38]| . 
EH, ll40l . BTIl . Considering its connection with data whiten¬ 
ing, semi-whitening is also expected to be effective in other 
tasks and applications. 

2) The suhprohlem on (w, h): Given M, the F-SVM model 
can be formulated as: 

1 "" 

min -w^Bw + C > 

w,6,^ 2 ’ 

^ (18) 
s.t. + 6) > 1 - Vi, 

> 0,i = ,n. 


where B = The eigenvalue decomposition of B is B = 

V5]V^. By introducing B = L^L, the transformation matrix 
L can be rewritten as to L = 5]2V^. Let = I]“ 2 V^Xi 
and V = Lw. With simple algebra, the problem in Eq. (fTSl) 
can be reformulated as: 


min 

v,6,^ 

S.t. 


i=l 

+ 6 ) > 1 

Ci >0,i = - ,n. 


(19) 


which can be solved using the off-the-shelf SVM solvers. 
Given the solution v, w = VI ]“2 v can then be obtained. 

3) The suhprohlem on M: Given (w,6), the sub-problem 
on M can be reformulated as: 


imn /(M) = - (v^^M ^w) -h ptr (MS), 
s.t. M 0. 


( 20 ) 


Since the objective function in Eq. (l20l) is convex and dif¬ 
ferentiable with respect to M, the gradient projection method 
ESl is adopted to update M. According to ll42l . the gradient 
of /(M) can be obtained by: 

Vf{M) = -^M" + pS. (21) 

As presented in Algorithm 1, we use gradient projection 


M = Vs+ (M - tV/(M)) (22) 

to update M by choosing proper stepsize t and gradually 
decreasing it along with iterations, where Vs+ (*) projects a 
matrix onto the cone of positive semidefinite matrices. 


C. Discussion 

The proposed F-SVM method has several interesting ad¬ 
vantages while compared with the other radius-margin based 
SVMs, e.g., RMM (TSI, MR-SVM (HIl, R-SVM+ R- 
SVM+ Ida and M-SVM RMM ESI is suggested to max¬ 
imize the margin while restricting the spread of the data along 
the direction perpendicular to the separating hyperplane, while 
our F-SVM is proposed to minimize the convex relaxation of 
the radius-margin ratio. The generalization error is bounded 


by the radius and margin ratio, and the radius is determined 
by the spread along all possible directions rather than only the 
direction perpendicular to the separating hyperplane, making 
F-SVM theoretically more promising. 

MR-SVM lHU, R-SVM+ dl and R-SVM+ dl aim to 
learn the diagonal feature transformation = Diag (p) 
with pk ^ 0, while F-SVM is developed for joint learning 
of feature transformation and SVM classifier. Both R-SVM+ 
and R-SVM+ need to solve a Quadratically Constrained 
Quadratic Programming (QCQP) optimization problem, which 
is computationally expensive than the alternating minimization 
method used in our F-SVM. Moreover, R-SVM+ and R- 
SVM+ adopted a tighter approximation Rq of the radius. In F- 
SVM, a new approximation R of the radius is proposed, which 
is also tighter than that used in MR-SVM ifTTIl . Moreover, 
the individual inequality constraints on R are combined to 
improve the robustness against outliers. It is interesting to note 
that we have: 

T. “ Xj)^]VI(xj - Xj) =tr (MSt) =4n {tr (MS)), (23) 

where St = (x^ — xj)(xt — x^)^. Eq. (1231) indi¬ 

cates that, if all the inequality constraints on Rq, i.e., 
Il^^xt — D^Xjll < r, are combined into one integrated 
inequality constraint: 

^(Xi - - Xj) = tr < k'. (24) 

Let n' = 4n/^ and M = D^D^. One can see that the 
integrated inequality constraint will be equivalent with that 
adopted in Eq. (HB. 

MSVM ini was developed for simultaneous learning of 
the linear transformation and SVM classifier, but the MSVM 
model is non-convex and solved using gradient projection. 
Moreover, although Zhu et al. ifTTll claimed that R = R, sls 
discussed in Section 3.1, R is only a lower bound of R and 
counter examples can be easily found to illustrate R ^ R. 
Compared with MSVM ifTTl , the F-SVM model is convex and 
robust against noise and outliers, and can be efficient solved 
using the optimization method introduced in Section 3.2. 

IV. Kernelization of F-SVM 

With the incorporation of kernel principal component anal¬ 
ysis, linear F-SVM can be extended to kernel version for 
nonlinear classification. First, we show that kernel SVM is 
equivalent to perform linear SVM in the kernel PCA space. 
Then, kernel F-SVM is introduced by conducting linear F- 
SVM in the kernel PCA space. 

Let the kernel function be Ff(xi,xj) = (/:?(xi)^(^(xj), 
where p (x) defines an implicit mapping from the data space to 
high or infinite dimensional feature space. For the training set 
S = {(xi,yi),(x„,y„)}, we use W = [wi, W 2 , ...,wd] 
to denote all the PCA eigenvectors corresponding to positive 
eigenvalues. Let W be a set of basis vectors in the comple¬ 
mentary space of W. Assuming the training set is centered, 
for any x^, we have W^(/ 9 (xi) = 0 , and thus can get: 

K(x„x,) = ¥^(xi)^WW^V^(x,) + ^(x,)^WW^^(x,-) 

= (^(xi)^WW^^(x,-). 
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TABLE I 

Summary of the UCI datasets used in the experiments. 


Dataset 

# of samples 

# of classes 

# of attributes 

Breast cancer 

286 

2 

9 

Diabetes 

768 

2 

8 

Solar Flare 

144 

3 

9 

German 

1000 

2 

20 

Heart 

270 

2 

13 

Image 

2310 

7 

19 

Ringnorm 

7400 

2 

20 

splice 

3190 

3 

60 

Thyroid 

215 

3 

5 

Twonorm 

7400 

2 

20 

Waveform 

5000 

3 

21 


Let ^i = W^(^(xi). The dual problem of SVM in the kernel 
PC A space can be formulated as: 


s.t. 


E ti 

1=1 

E n 

. aiyi = 0 , 

?,= 1 


J_ ^ ^ f I 

“ 2 Li j=i fi)> 


(26) 


^i=l 
0<ai<C, 


i = 1, n. 


where (fi,fj) = K (xi,Xj). Therefore, kernel SVM is equiv¬ 
alent to performing linear SVM in the kernel PC A space. To 
extend F-SVM to its kernelized version, we first project each 
training sample x^ to the kernel PC A space ^i = W^(/?(xi), 
and then solve the following F-SVM model: 


1 

UAt 9 V) + C ^ ^i + pir (MS/), 

i=l 

S.t. + 6) > 1 - Vi, (27) 

^. > 0,i = ,n, 

M ^ 0. 

where S/ = X]r=i • Algorithm 1 can be adopted to solve 
the model in Eq. (27). In our implementation, instead of using 
all the eigenvectors, we also consider to employ the PC A 
eigenvectors corresponding to first d largest eigenvalues, i.e., 
W = [wi, W 2 ,..., w/:)], and Section 5.2 reports the empirical 
result on the influence of d on classification accuracy. 


V. Experiments 

In this section, we use both the UCI machine learning 
datasets and the Labeled Faces in the Wild (LEW) database 
to evaluate the proposed F-SVM method, and compare our F- 
SVM with the competing methods, including SVM and several 
representative radius-margin based SVM methods, i.e., RMM 
OSI, R-SVM+ ^ and R-SVM+ (HSl. MR-SVM [El and 
MsvM Ini are not considered in our experiments because 
their source codes are not publicly available. The 10-fold cross 
validation (CV) is adopted to determine the optimal values 
of hyper-parameters for each method. The mean classification 
accuracy is adopted by averaging the 100 runs of the 10-fold 
CV. The methods are evaluated by two performance indicators: 
accuracy and training time (seconds, s). 


A. Evaluation on linear F-SVM using the UCI datasets 

We evaluate the performance of linear F-SVM on the 11 
datasets from the UCI machine learning repository, where the 


TABLE II 

Comparison oe the average classieication accuracy (%) oe 

LINEAR SVM, LINEAR RMM (TD, LINEAR R-SVM+ (TS), LINEAR 
R-SVM+ lE), AND LINEAR F-SVM. 


Dataset 

SVM 

RMM 

R-SVM+ 

R-SVM+ 

F-SVM 

Breast cancer 

71.40 

70.19 

71.54 

71.14 

71.68 

Diabetes 

76.57 

76.29 

76.67 

76.42 

77.00 

Solar Flare 

67.66 

67.38 

67.66 

67.66 

67.69 

German 

75.58 

75.99 

76.01 

75.87 

76.04 

Heart 

83.61 

83.64 

83.83 

83.96 

84.02 

Image 

83.77 

84.12 

84.39 

83.97 

84.32 

Ringnorm 

75.41 

75.78 

75.63 

75.43 

77.05 

splice 

84.54 

85.05 

84.67 

84.74 

84.81 

Thyroid 

89.76 

91.19 

91.23 

90.09 

86.81 

Twonorm 

96.92 

97.79 

97.41 

97.39 

97.08 

Waveform 

86.95 

88.54 

88.51 

86.88 

86.76 


TABLE III 

Comparison oe the training time ( 5 ) oe linear SVM, linear 
RMM (161, linear R-SVM+ (Tsl, linear R-SVM+ (TH, and linear 

F-SVM. 


Dataset 

SVM 

RMM 

R-SVM+ 

R-SVM+ 

F-SVM 

Breast cancer 

6.20x10“^ 

3.70x1(L^ 

2.29x10^^ 

0.20x10+^ 

6.60x10“^ 

Diabetes 

1.12x10“^ 

2.63x1(L^ 

2.80xl(f-^ 

1.20x1(L^ 

1.38x10“^ 

Solar Flare 

7.79x10“^ 

1.38x1(L^ 

3.41x10^^ 

4.61x10^ 

6.60x10“^ 

German 

4.83x10“^ 

3.77x1(L^ 

7.38x10^^ 

1.55x1(L^ 

5.95x10“^ 

Heart 

1.50x10“^ 

3.06x1(L^ 

5.09x10^^ 

0.29x10^ 

1.80x10“^ 

Image 

2.29x10^ 

2.21x1(L^ 

5.79x10^^ 

1.48x1(L^ 

2.40x10^ 

Ringnorm 

1.42x10^^ 

7.46x1(L^ 

2.63x10^ 

1.13x1(L^ 

1.42x1(L^ 

splice 

4.02x10^^ 

2.77x1(L^ 

1.09x10^ 

5.69x1(L^ 

1.12x10^ 

Thyroid 

8.70x10“^ 

2.06x1(L^ 

2.79x10^^ 

0.21x10^ 

1.59x10“^ 

Twonorm 

5.99x10^ 

3.14x1(L^ 

9.91x10^^ 

I.IOxKL^ 

3.16x10“^ 

Waveform 

1.30x10“^ 

6.54x10^ 

5.20x10^^ 

1.19x1CL^ 

6.71x10“^ 


reason to choose them is that they had been widely adopted 
for evaluating SVM and kernel methods ||43l, O, BSl . Table 
U provides a brief summary of these UCI datasets, which 
includes 6 2-class problems and 5 multi-class problems. Tables 
iniandlllll list the mean classification accuracy and training 
time of five linear classifiers, i.e., linear SVM, linear RMM 
IT61 . linear R-SVM+ d, linear R-SVM+ (HI, and linear F- 
SVM. RMM d, R-SVM+ d, R-SVM+ d, and F-SVM 
consider both margin and radius information, while SVM only 
considers margin. As shown in Table [III the radius-margin 
based SVM methods generally outperform SVM in terms of 
classification accuracy, which indicates that the incorporation 
of radius can improve the classification performance. As listed 
in Table [Till the training time of SVM is much less than the 
other four methods, indicating that the introduction of radius 
makes the model more complex to train. 

We further compare linear F-SVM with the competing 
methods. From Table (III F-SVM achieves higher classification 
accuracy than SVM on 9, RMM d on 7, R-SVM+ d 
on 7, and R-SVM+ IT^ on 8 of the 11 datasets. The better 
classification accuracy of our F-SVM should be attributed to 
that: (i) compared with SVM, F-SVM incorporates radius 
in the convex model; (ii) unlike RMM IT^ . our F-SVM 
considers the spread along all directions rather than only 
the direction perpendicular to the separating hyperplane, {in) 
instead of feature reweighting and selection in R-SVM+ IT^ 
and R-SVM+ HHI, general linear transformation is learned 
in F-SVM. To improve the efficiency of F-SVM in training, 
we adopt the warm-start strategy, where the solution (w, h) of 
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the previous iteration is used as the initialization of the next 
iteration. From Table nni one can see that our F-SVM is only 
a little slower than SVM, but is much more efficient than the 
other competing methods in training. F-SVM is about 
times faster than RMM CSI, R-SVM+ and R-SVM+ in ifT^ . 
In summary, F-SVM obtains the best classification accuracy 
among all competing methods, and is more efficient in training 
than the other radius-margin based SVM methods. 


TABLE IV 

Comparison of the average classieication accuracy (%) oe 
KERNEL SVM, KERNEL RMM lfT6l . KERNEL R-SVM+ (TH, KERNEL 
R-SVM+ (m, AND KERNEL F-SVM. 


Dataset 

SVM 

RMM 

R-SVM+ 

R-SVMj 

F-SVM 

Breast cancer 

73.74 

74.15 

71.54 

71.14 

73.95 

Diabetes 

76.83 

74.97 

77.05 

76.80 

78.84 

Solar Flare 

67.64 

66.33 

67.66 

67.54 

67.66 

German 

76.36 

76.58 

76.01 

75.93 

76.90 

Heart 

83.43 

82.19 

83.83 

83.96 

84.25 

Image 

97.14 

96.41 

84.39 

83.97 

96.93 

Ringnorm 

98.41 

86.16 

75.63 

75.43 

98.58 

splice 

90.16 

89.34 

84.85 

84.97 

90.55 

Thyroid 

95.91 

95.86 

91.23 

91.31 

96.13 

Twonorm 

97.59 

95.43 

97.41 

97.39 

97.79 

Waveform 

89.75 

92.60 

88.51 

89.38 

90.95 


B. Evaluation on kernel F-SVM using the UCI datasets 

In this subsection, we evaluate the performance of kernel 
F-SVM on the 11 UCI datasets. The Gaussian RBF kernel is 
adopted in our experiments, which includes an extra kernel 
parameter a. As discussed in Section 4, we also consider the 
number of kernel PCA components in kernel F-SVM. Using 
four datasets, i.e.. Breast cancer. Thyroid, Heart, and German, 
Fig. [2] illustrates the classification accuracy of kernel SVM and 
kernel F-SVM under different kernel PCA dimensions. It is 
interesting to note that, the proper decreasing of kernel PCA 
dimension can consistently improve the classification accuracy 
both for kernel SVM and kernel F-SVM. Also from Fig. [2 
one can see that the kernel F-SVM is superior to kernel SVM 
under different dimensions. One possible explanation may be 
that the decreasing of kernel PCA dimension would make the 
learned transformation more stable. 

Tables [IV| and [V] list the mean classification accuracy and 
training time of five linear classifiers, i.e., kernel SVM, kernel 
RMM d, kernel R-SVM+ d, kernel R-SVM+ d, and 
kernel F-SVM. For kernel methods, the superiority of F-SVM 
against the competing methods is more significant. The Kernel 
F-SVM outperforms kernel SVM on 10, kernel RMM on 
9, kernel R-SVM+ d on 11, and kernel R-SVM+ d on 
11 of all the 11 datasets in terms of classification accuracy. By 
training time, the kernel F-SVM is a little slower than SVM, 
but is about 10^ ^10"^ times faster than the other competing 
methods. 

C. Results on the LEW Database 

In this subsection, the LFW database is used to evaluate F- 
SVM for face verification. The database consists of more than 
13,233 face images from 5,749 persons. The face images in 
the LFW database were collected from the Internet, and vary 


TABLE V 

Comparison oe the training time (s) oe kernel SVM, kernel 
RMM d, KERNEL R-SVM+ d, KERNEL R-SVM+ d, AND 
KERNEL F-SVM. 


Dataset 

SVM 

RMM 

R-SVM+ 

R-SVM+ 

F-SVM 

Breast cancer 

1.90x10“^ 

i.ooxicL^ 

2.24x10^^ 

1.83xlof^ 

6.60x10“^ 

Diabetes 

1.11x10“^ 

5.70 x 1(L^ 

3.47x10^^ 

5.02 x 1(L^ 

2.57x10“^ 

Solar Flare 

1.10x10“^ 

9.01 x 1(L^ 

3.52x10^^ 

4.58 x 1(L^ 

1.80x10“^ 

German 

4.70x10“^ 

1.17 x 1(L^ 

5.30x10^^ 

1.83 x 1(L^ 

6.08x10“^ 

Heart 

1.00x10“^ 

3.46 x 1(L^ 

5.08 x 10^^ 

l.lSxlCL^ 

1.20x10“^ 

Image 

5.30x10“^ 

4.31 x 1(L^ 

3.44x10^^ 

1.59 x 1(L^ 

5.98x10“^ 

Ringnorm 

3.08x10“^ 

2.93x10^ 

3.59x10^ 

6.25 x 1(L^ 

3.75x10“^ 

splice 

6.28x10“^ 

6.49 x 1(L^ 

3.13x10^ 

6.69xl(f“^ 

6.07x10“^ 

Thyroid 

4.22x10-^ 

4.51 x 1(L^ 

4.13x10^^ 

7.13x10^ 

5.95x10-^ 

Twonorm 

2.81x10“^ 

3.26x10^ 

1.97x10^ 

2.62 x 1(L^ 

3.83x10“^ 

Waveform 

5.79x10“^ 

4.24x10^ 

1.04x10^ 

3.18 x 1(L^ 

6.19x10“^ 





(a) Breast cancer 


(b) Thyoid 


1 —e— SVM 1 

76 

A 1—^1 





I70 



68 
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dimension 


100 200 300 400 500 600 700 

dimension 

(c) Heart 


(d) German 


Fig. 2. Classification accuracy (%) of kernel SVM and kernel F-SVM under 
different kernel PCA dimensions. 


in pose, illumination, expression, and age, making LFW very 
suitable for studying unconstrained face verification. The face 
recognition method can be evaluated with two test protocols 
for LFW: the restricted and the unrestricted settings. Under the 
restricted setting, the only available information is whether 
each pair of training images is matched or not, and the 
performance of the face verification method is evaluated by 
10-fold cross validation on a set of 3000 positive and 3000 
negative image pairs. 

In our experiment, we adopt the restricted setting with the 
face images aligned by the funneling method ll46ll . Fig. [3] 
shows some examples of similar and dissimilar pairs. We 
extract two kinds of features for each face image: SIFT feature 
and attribute feature, and compare F-SVM with SVM, RMM 
ifTbl . R-SVM+ and R-SVM+ ESI, and several represen¬ 
tative face verification methods, including LDML ||24l, Nowak 
IHTI . Vl-like/MKL ESI, and MERL-fNowak ll49ll . 

Fig. El shows the verification accuracies of SVM and F-SVM 
under different PCA dimensions by using the attribute feature 
and the combined features of SIFT and attributes, respectively. 
F-SVM using the combined features of SIFT and attributes 
(F-SVM-combined) achieves its best performance of 83.25% 
when the dimension d = 300, and 82.58% when the dimension 
d = 73 using the attribute features (F-SVM-attribute). SVM 
using the combined features of SIFT and attributes (SVM- 
combined) achieves its best performance of 81.90% when the 
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—»— F-SVM 

10 20 30 40 50 60 70 

dimension 


(a) The attribute features 



dimension 


(b) Dissimilar pairs 


(b) The combined features of SIFT and attributes 


Fig. 3. Examples of face image pairs in the LFW database. 4 Examples of face image pairs in the LFW database. 


dimension d = 400, and 80.12% when the dimension d = 73 
using the attribute features (SVM-attribute). Thus, F-SVM can 
get better accuracy than SVM on the LFW database. 

We further compare F-SVM with several other face verifi¬ 
cation methods. Table 6 lists the accuracy of F-SVM, SVM, 
RMM lUSl, R-SVM+ ESI, R-SVM+ d, LDML d, Nowak 
iHTll . Vl-like/MKL (481, and MERL-fNowak (H. We report 
the accuracy of F-SVM, SVM, RMM (Bl, R-SVM+ d, R- 
SVM+ EHl, LDML (241 using the attribute and the SIFT fea¬ 
tures, report the accuracy of Nowak (47l and MERL-i-Nowak 
(49l using SIFT and geometry feature, and report the accuracy 
of Vl-like/MKL (481 using the VI-like features. For either 
the combined features of SIFT and attributes or the attribute 
features, F-SVM achieves higher accuracy than SVM, RMM 
d,R-SVM+ d, R-SVM+ d, LDML (21 separately 
from Table Fig. [5] shows the ROC curves of the competing 
methods. Also one can see that F-SVM-combined gets better 
performance than other face verification methods. 

VI. Conclusion 

In this paper, we proposed a convex radius-margin based 
SVM model (F-SVM) for joint learning of feature transfor¬ 
mation and SVM classifier. For the formulation of F-SVM, 
lower and upper bounds of the radius of MEB are introduced 
to derive a novel approximation of radius-margin ratio, and 
all the individual inequality constraints are combined into 
one integrated inequality constraint, resulting in a convex 
relaxation of the radius-margin based SVM model. For model 
optimization, a semi-whitened PCA based method is proposed 
for the initialization of the learned transformation, and an 
alternating minimization algorithm is adopted to learn the 



(a) ROC curves 



(b) Cropped and zoom-in region of (a) 

Fig. 5. ROC curves of different face verification methods on the restricted 
LFW-funneled database. 
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TABLE VI 

Comparison of accuracy obtained using different face 

VERIFICATION METHODS. THE TOP TWO RESULTS ARE SHOWN IN RED 
AND BLUE FONTS, RESPECTIVELY. 


Method, restricted 

Accuracy 

F-SVM-combined 

83.25 

F-SVM-attribute 

82.58 

R-SVM+-combined [Ts] ~ 

81.90 

R-SVM+-attribute 1181 ~ 

81.50 

R-SVM+-combined IlsF 

81.67 

R-SVM+-attribute [18) 

81.46 

RMM-combined 1161 

81.30 

RMM-attribute [1^ 

81.27 

SVM-combined 

81.90 

SVM-attribute 

80.12 

Nowak l47i 

73.93 

LDML 1241 

79.27 

Vl-like/MKL Bsl 

79.35 

MERL+Nowak l49i 

76.18 


Based on the definition of Rp'. 

Combining the two inequalities above, we can prove R < 
Rp. □ 

Finally, by combining Lemmas 1^3, we obtain the follow¬ 
ing theorem: 

Theorem A.l. The margin R is bounded by ^ by: 
^R<R<R. 

Appendix B 

Lemma B.l 14^ . Given two symmetric positive definite 
(SPD) matrices A and B, we have 


feature transformation and SVM classifier. Further, F-SVM 
is kernelized by using kernel PCA. Our experimental results 
show that, F-SVM obtains higher classification accuracy than 
SVM and the state-of-the-art radius-margin based SVM meth¬ 
ods, and is more efficient in training than the other radius- 
margin based SVM methods, e.g., RMM Uhl, R-SVM+ |[T^ 
and R-SVM+ |[T^ . In our future work, we will extend the 
proposed relaxed radius-margin based error bound to other 
classification methods and extend the proposed model for 
learning other forms of feature transformation tailored for 
specific applications. 


Appendix A 

Lemma A.l. R> R. 

Proof. Based on the definition of the radius, we get 

R^ = min max jjAx^ — AxoHo 

xo i 

< max II Axi — AxU? 
i 

= R^. 

□ 


Denote Rp by the maximum pairwise distance. We have 
Rp = max{||Axi — AxjH^}. 

UJ 

Lemma A.2 |[T8l. R > Rp/2. 

Lemma A.3. R < Rp. 

Proof. Let x'^ = Ax^ — Ax. We have Ax^ — Axj = x'i — x'j. 
Based on the definition of R 

= max|||x',||^| = ||x',.||^, 

we will prove that there exist some j* which makes 
Ijx'^H. > ||x'^*||^. Based on the definition of x, we 

have X.' j = 0. Then, we derive 

X jx'i* = 0 ^ min {x!jx'i*} = x j*x i* < 0. 

Since ||x' rf > 0 and —2x'j*x'i* > 0, one can easily see 
that 

||xV-xVf >||xVf. 


(A + B)“ = A“^ - + B-^) A-\ (28) 


(A + B)-' =A-i(A-i+B-i) 'b 


► -1 


= B“i(A“i+B“i) ^A“^ 


(29) 


Theorem B.2. The problem Eq. (O is a convex optimiza¬ 
tion problem. 


Proof. Note that all the constraints define a convex set, and 
and tr (MS) are linear to ^ and M, respectively. Then 
the key step is to prove that the function w^M“^w is convex 
for M ^ 0, i.e., for any 1 > > 0, 

+ (1 — 6 >)w^M^^W 2 > 

(6>wi + (1 — ^)w2)^(6>Mi + (1 — 6>)M2)~^ (6>wi + (1 — 0)w2 ). 

Note that (6>wi+(l—6>)w2)^(^MiH-(l—^)M2 )~^ (6>wi-F( 1—^)w2) 
contains three terms: 

(6>Mi + (1 — 6>)M2 )~^wi 
(1 — 0)‘^W2 {ONLi + (1 — 6>)M2)~^W2 
0(1 — 0)wi (6>Mi + (1 — 6>)M2)~^W2. 


First, we have 


0‘^{0Mi + (1 - 6>)M2)“^ = 6> Ml + 


( 1 -^) 


M2 


-1 


= 6> M-^-M-M M-^H 


(1-^) 

0 


M, 




= ((6»Mi)“^ + ((1 - 6»)M2 )“^) 

and then we get 

M^^wi — (6>Mi + (1 — 6>)M2 )~^wi 

= wf]V[li((6»Mi)“^ + ((l-6»)]V[2)“^) 

Analogously, we get 

(1 —6>)w|'M^^W 2 —(1 —(6>Mi-h(l —6>)M2)~^W2 


(30) 


= wI’Mj 


i(( 0 Mi)-V(( 1 - 0 )M 2 )“')’ 


Mn ^W2. 


(32) 




















10 


With Eq. (1^ . we have 
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